Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grandpa: handle error from SelectChain::finality_target #5153

Merged
merged 7 commits into from
Jul 28, 2024

Conversation

andresilva
Copy link
Contributor

Fix #3487.

@andresilva
Copy link
Contributor Author

See humanode-network/humanode#1104 (comment) and humanode-network/humanode#1104 (comment). I didn't think through all the consequences of this, but I think this is correct. This is what was happening before paritytech/substrate#13289, since we were just calling self.backend.blockchain().best_containing, which would look for the block in all forks (not just the best).

If someone could add a regression test for this I would appreciate it. Feel free to push to this PR or just create a new one.

@dmitrylavrenov
Copy link
Contributor

See humanode-network/humanode#1104 (comment) and humanode-network/humanode#1104 (comment). I didn't think through all the consequences of this, but I think this is correct. This is what was happening before paritytech/substrate#13289, since we were just calling self.backend.blockchain().best_containing, which would look for the block in all forks (not just the best).

If someone could add a regression test for this I would appreciate it. Feel free to push to this PR or just create a new one.

Will add the test.

@andresilva andresilva force-pushed the andre/fix-grandpa-finality-target branch from d401036 to b1447b9 Compare July 26, 2024 11:45
Copy link

@MOZGIII MOZGIII left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on the first sight, but let me reconcile this logic with the LongestChain::best_chain implementation as there might be more surprises there

Copy link

@MOZGIII MOZGIII left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on the first sight, but let me reconcile this logic with the LongestChain:: best_chain implementation as there might be more surprises there

That was quite trivial; there is still a race possible there between the invocation of the best_header in the finality_target and best_chain - but I think this will not stall the consensus.

@bkchr
Copy link
Member

bkchr commented Jul 26, 2024

Will add the test.

@dmitrylavrenov do you already have the test somewhere?

@bkchr bkchr marked this pull request as ready for review July 26, 2024 12:51
@dmitrylavrenov
Copy link
Contributor

Will add the test.

@dmitrylavrenov do you already have the test somewhere?

Yep, humanode-network/polkadot-sdk@master...humanode-network:polkadot-sdk:grandpa-voting-and-babe-reorg-race-condition

Can prepare new PR to this one soon

@bkchr
Copy link
Member

bkchr commented Jul 26, 2024

Did you now changed it so that it fails without the fix?

@dmitrylavrenov
Copy link
Contributor

Did you now changed it so that it fails without the fix?

Yep, andresilva#1

dmitrylavrenov and others added 4 commits July 26, 2024 14:10
* Initial implementation of race condition of grandpa voting and babe reorg

* Add comments

* Remove toolchain

* Add finalization check logic

* Improve test logic
@andresilva
Copy link
Contributor Author

@dmitrylavrenov @MOZGIII thanks for debugging and providing the test.

@dmitrylavrenov
Copy link
Contributor

The test has been check on my end

  • failed on master
  • success with this fix

@andresilva
Copy link
Contributor Author

andresilva commented Jul 26, 2024

Looks good on the first sight, but let me reconcile this logic with the LongestChain:: best_chain implementation as there might be more surprises there

That was quite trivial; there is still a race possible there between the invocation of the best_header in the finality_target and best_chain - but I think this will not stall the consensus.

Indeed, that is what paritytech/substrate#13364 solved. There is no way to get around this race condition other than unifying best_chain and finality_target into a single asynchronous call. I don't remember the details but I think that wasn't so trivial to do, so we went with that fix instead (i.e. setting best = target).

@bkchr
Copy link
Member

bkchr commented Jul 26, 2024

@dmitrylavrenov @MOZGIII thanks for debugging and providing the test.

Yeah! I can only second this! Really good work.

@bkchr bkchr added the T0-node This PR/Issue is related to the topic “node”. label Jul 26, 2024
@bkchr bkchr enabled auto-merge July 28, 2024 19:33
@bkchr bkchr added this pull request to the merge queue Jul 28, 2024
Merged via the queue into paritytech:master with commit dc4047c Jul 28, 2024
160 of 161 checks passed
bkchr added a commit that referenced this pull request Jul 29, 2024
Fix #3487.

---------

Co-authored-by: Dmitry Lavrenov <39522748+dmitrylavrenov@users.noreply.github.com>
Co-authored-by: Bastian Köcher <info@kchr.de>
Co-authored-by: Bastian Köcher <git@kchr.de>
EgorPopelyaev pushed a commit that referenced this pull request Jul 29, 2024
Fix #3487.

---------

Co-authored-by: André Silva <123550+andresilva@users.noreply.github.com>
Co-authored-by: Dmitry Lavrenov <39522748+dmitrylavrenov@users.noreply.github.com>
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
)

Fix paritytech#3487.

---------

Co-authored-by: Dmitry Lavrenov <39522748+dmitrylavrenov@users.noreply.github.com>
Co-authored-by: Bastian Köcher <info@kchr.de>
Co-authored-by: Bastian Köcher <git@kchr.de>
@FlorianFranzen
Copy link
Contributor

FlorianFranzen commented Aug 23, 2024

We just deployed this to our testnet as part on updating from 1.13.0 to 1.15.1 with only on a single node to begin with and it caused all our other nodes to crash with a message suggesting they now assume they are on a dead fork.

2024-08-23 08:54:36.428  WARN tokio-runtime-worker db::blockchain: Block 0x0553bc04843513fb40fb15b6c010dcf173845a1d6e1bb095b93699332a8ab84e exists in chain but not found when following all leaves backwards
2024-08-23 08:54:36.428 ERROR tokio-runtime-worker grandpa: GRANDPA voter error: could not complete a round on disk: Failed to set the chain head to a block that's too old.    
2024-08-23 08:54:36.428 ERROR tokio-runtime-worker sc_service::task_manager: Essential task `grandpa-voter` failed. Shutting down service.    
Error: Service(Other("Essential task failed."))

We are still investigating the underlying issue, just wanted to share this here in case anybody has an idea. We currently still hope it has to do with the stalled finality or the way we configured our grandpa/node.

It should also be noted that we can no longer revert even a single finalized block on our node, as it complains that the block to be reverted is part of the chain but can not be found going backwards from any of the leaves, which also suggest that the node assumes we are on a dead branch of some sort.

EDIT: Latest theory is that our finality lag just hit the prunning window and the real trigger was the nodes restarting. But the numbers do not really add up. And I always assumed only finalized blocks are prunned, unless maybe this patch broke that...

@MOZGIII
Copy link

MOZGIII commented Aug 25, 2024

Thanks for sharing! Could you provide more details on the initial conditions were when you've started the deployment? Was the consensus stalled while the update was conducted?

@MOZGIII
Copy link

MOZGIII commented Aug 25, 2024

This actually looks like a different issue; is seems like a quite long chain of unfinalized blocks would be evicted by the finality round that is about to run - to that end that the block that the system is about to to vote for was very long time ago and got pruned. Could be all sorts of issues:

  • a the block pruning logic bug (you wouldn't want to prune unfinalized blocks cause this could happen) - might be the real issue here; this patch doesn't change that, but other updates - even prior to 1.13.0 you're starting with - might've, so you really have to dig deep
  • longest chain selection bug; thinking about it - this is very relevant for this fix and consensus stalls, but here is not bug; the thing is if grandpa started a round (casted prevotes) and then stalled (like with a bug that this PR fixes) the target block for the finality algorithm gets fixed and then never changes due to grandpa rules; so, this means that if the finality stalled a long time ago, and then this fix got shipped, it just got unstalled but stumbled upon the bug above; not really a problem with grandpa per se, as everything is going according to the rules

One thing I still don't get though is why don't we have the ability to set a delay of a couple of blocks for prevote target selection and casting. It seems like the configured finality delay (2 blocks usually) is completely ignored when the prevotes are being casted, as the very nature of the root cause of this could manifest so potently is because the longest chain selection rule didn't converge the chain naturally, which is what's the purpose of the longest chain rule is. In short, maybe there is another issue to report here.

@bkchr
Copy link
Member

bkchr commented Aug 25, 2024

EDIT: Latest theory is that our finality lag just hit the prunning window and the real trigger was the nodes restarting. But the numbers do not really add up. And I always assumed only finalized blocks are prunned, unless maybe this patch broke that...

How big is your finality lag?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Finalisation stopped : Encountered error finding best chain
8 participants